Add columnar data access for memory-efficient row processing#975
Conversation
| } | ||
|
|
||
| /** Interface for accessing column values by index without materializing the entire column. */ | ||
| private interface ColumnAccessor { |
There was a problem hiding this comment.
use separate files for interface and impl
| if (column.isSetStringVal()) return column.getStringVal().getValuesSize(); | ||
|
|
||
| throw new DatabricksSQLException( | ||
| "Unsupported column type: " + column, DatabricksDriverErrorCode.UNSUPPORTED_OPERATION); |
There was a problem hiding this comment.
what about complex datatypes? Will they also be covered in above primitive types?
There was a problem hiding this comment.
We only support these
columns. There is nothing new added or removed in these changes. If complex types come as binary (which i think is the case), complex types are supported. Otherwise, not and this is the current behaviour too.| * out of bounds | ||
| */ | ||
| @Override | ||
| public Object getObject(int columnIndex) throws DatabricksSQLException { |
There was a problem hiding this comment.
will this work out of box? You return primitive types from ColumnAccessor, and here we can have complex types as well. Will the conversion happen implicitly?
There was a problem hiding this comment.
There is a binary type as well. Added more details #975 (comment) in this comment.
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a memory-efficient columnar data access mechanism for JDBC result processing. Instead of materializing entire result sets into List<List<Object>> structures, it provides direct access to columnar data through a new ColumnarRowView class, resulting in significant memory reduction (up to 91% in testing) and improved CPU performance.
- Introduces
ColumnarRowViewclass for memory-efficient row-by-row data access - Updates
LazyThriftResultto use columnar views instead of materialized row lists - Adds utility method in
DatabricksThriftUtilfor creating columnar views
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ColumnarRowView.java |
New class providing direct columnar access with getValue(row, col) method and null handling |
LazyThriftResult.java |
Refactored to use ColumnarRowView instead of List<List<Object>> for batch processing |
DatabricksThriftUtil.java |
Added createColumnarView() utility method as memory-efficient alternative |
ColumnarRowViewTest.java |
Comprehensive test coverage for all column types, null handling, and boundary conditions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| private final BitSet nullBits; | ||
|
|
||
| public TypedColumnAccessor(List<T> values, byte[] nulls) { | ||
| this.values = values; | ||
| this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null; | ||
| } | ||
|
|
||
| @Override | ||
| public Object getValue(int rowIndex) { | ||
| if (nullBits != null && nullBits.get(rowIndex)) { | ||
| return null; |
There was a problem hiding this comment.
Creating a new BitSet for every column accessor could be memory-intensive for large datasets. Consider lazy initialization or caching the BitSet creation to improve memory efficiency.
| private final BitSet nullBits; | |
| public TypedColumnAccessor(List<T> values, byte[] nulls) { | |
| this.values = values; | |
| this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null; | |
| } | |
| @Override | |
| public Object getValue(int rowIndex) { | |
| if (nullBits != null && nullBits.get(rowIndex)) { | |
| return null; | |
| private final byte[] nulls; | |
| private BitSet nullBits; | |
| public TypedColumnAccessor(List<T> values, byte[] nulls) { | |
| this.values = values; | |
| this.nulls = nulls; | |
| this.nullBits = null; // Lazy initialization | |
| } | |
| @Override | |
| public Object getValue(int rowIndex) { | |
| if (nulls != null) { | |
| if (nullBits == null) { | |
| nullBits = BitSet.valueOf(nulls); | |
| } | |
| if (nullBits.get(rowIndex)) { | |
| return null; | |
| } |
There was a problem hiding this comment.
Nice suggestion. Thanks. Will implement in subsequent PR.
## Description
<!-- Provide a brief summary of the changes made and the issue they aim
to address.-->
This PR introduces lazy loading support for inline Arrow results to
improve memory efficiency when handling large result sets.
Previously, InlineChunkProvider would eagerly fetch all arrow batches
upfront when results had hasMoreRows = true, which could lead to memory
issues with large datasets. This change splits the handling into two
separate paths:
1. Lazy path (new): For Thrift-based inline Arrow results (when
ARROW_BASED_SET is returned), we now use LazyThriftInlineArrowResult
which fetches arrow batches on-demand as the client iterates through
rows. This is similar to how LazyThriftResult works for columnar data.
2. Remote path (existing): For URL-based Arrow results (URL_BASED_SET),
we continue using ArrowStreamResult with RemoteChunkProvider which
downloads chunks from cloud storage.
The InlineChunkProvider is now only used for SEA results with JSON_ARRAY
format and INLINE disposition (contain all data inline {no hasMoreRows
flag set}).
This will reduce memory consumption and improve performance when dealing
with large inline Arrow result sets similar to
#975.
## Testing
<!-- Describe how the changes have been tested-->
- Unit tests
- Integration tests
- Manual testing
## Additional Notes to the Reviewer
<!-- Share any additional context or insights that may help the reviewer
understand the changes better. This could include challenges faced,
limitations, or compromises made during the development process.
Also, mention any areas of the code that you would like the reviewer to
focus on specifically. -->
Bypassing an existing failure on CI/CD because of 3e4f21c
Description
This PR contains changes from the PR #966 as well.
Introduce ColumnarRowView to provide direct access to columnar data without
materialising entire result sets into row objects. This reduces memory
allocations by allowing individual cell access via
getValue(row, col)instead of creating
List<List<Object>>structures.Key changes:
This optimization maintains API compatibility while significantly reducing
memory overhead for large result sets.
Following the changes introduced in PR #966, the following improvements were
observed during a test that executes a SQL query retrieving 5 million rows:
Current heap usage over time:

Improved heap usage over time:

Testing
Additional Notes to the Reviewer